1,377 research outputs found

    Multiple Retrieval Models and Regression Models for Prior Art Search

    Get PDF
    This paper presents the system called PATATRAS (PATent and Article Tracking, Retrieval and AnalysiS) realized for the IP track of CLEF 2009. Our approach presents three main characteristics: 1. The usage of multiple retrieval models (KL, Okapi) and term index definitions (lemma, phrase, concept) for the three languages considered in the present track (English, French, German) producing ten different sets of ranked results. 2. The merging of the different results based on multiple regression models using an additional validation set created from the patent collection. 3. The exploitation of patent metadata and of the citation structures for creating restricted initial working sets of patents and for producing a final re-ranking regression model. As we exploit specific metadata of the patent documents and the citation relations only at the creation of initial working sets and during the final post ranking step, our architecture remains generic and easy to extend

    Simple vs. sophisticated approaches for patent prior-art search

    Get PDF
    Patent prior-art search is concerned with finding all filed patents relevant to a given patent application. We report a comparison between two search approaches representing the state-of-the-art in patent prior-art search. The first approach uses simple and straightforward information retrieval (IR) techniques, while the second uses much more sophisticated techniques which try to model the steps taken by a patent examiner in patent search. Experiments show that the retrieval effectiveness using both techniques is statistically indistinguishable when patent applications contain some initial citations. However, the advanced search technique is statistically better when no initial citations are provided. Our findings suggest that less time and effort can be exerted by applying simple IR approaches when initial citations are provided

    GROBID: combining Automatic Bibliographic Data Recognition and Term Extraction for Scholarship Publications

    Get PDF
    Περιέχει το πλήρες κείμενοBased on state of the art machine learning techniques, GROBID (GeneRation Of BIbliographic Data) performs reliable bibliographic data extractions from scholar articles combined with multi-level term extractions. These two types of extraction present synergies and correspond to complementary descriptions of an article. This tool is viewed as a component for enhancing the existing and the future large repositories of technical and scientific publications

    Using citation-context to reduce topic drifting on pure citation-based recommendation

    Get PDF
    Recent works in the area of academic recommender systems have demonstrated the effectiveness of co-citation and citation closeness in related-document recommendations. However, documents recommended from such systems may drift away from the main theme of the query document. In this work, we investigate whether incorporating the textual information in close proximity to a citation as well as the citation position could reduce such drifting and further increase the performance of the recommender system. To investigate this, we run experiments with several recommendation methods on a newly created and now publicly available dataset containing 53 million unique citation-based records. We then conduct a user-based evaluation with domain-knowledgeable participants. Our results show that a new method based on the combination of Citation Proximity Analysis (CPA), topic modelling and word embeddings achieves more than 20% improvement in Normalised Discounted Cumulative Gain (nDCG) compared to CPA

    Représenter et utiliser les contraintes de la langue orale à l'aide d'une grammaire lexicalisée d'arbres adjoints

    Get PDF
    Colloque avec actes et comité de lecture.Cet article souligne le problème de l'analyse grammaticale des énoncés oraux incomplets en contexte de dialogue homme-machine. Des contraintes minimales de l'oral sont cependant exploitables afin de rester prédictif face aux phénomènes d'ellipses. Nous proposons un enrichissement du formalisme LTAG afin de capter ces contraintes et d'adapter à l'oral une grammaire initialement conçue pour l'écrit

    GRISP: A Massive Multilingual Terminological Database for Scientific and Technical Domains

    Get PDF
    International audienceThe development of a multilingual terminology is a very long and costly process. We present the creation of a multilingual terminological database called GRISP covering multiple technical and scientific fields from various open resources. A crucial aspect is the merging of the different resources which is based in our proposal on the definition of a sound conceptual model, different domain mapping and the use of structural constraints and machine learning techniques for controlling the fusion process. The result is a massive terminological database of several millions terms, concepts, semantic relations and definitions. This resource has allowed us to improve significantly the mean average precision of an information retrieval system applied to a large collection of multilingual and multidomain patent documents

    A Framework for Multi-level Linguistic Annotation

    Get PDF
    International audienceThis article presents a 3-step model for multi- layer annotations of corpora. Each kind of an- notation for a textual corporacorresponds to a dierent view on the same document. This prin- ciple can be expressed rst with a general re- lational model dedicated to the organisation of LR. This abstract model is then implemented as an application of the XML formalism for the en- coding of large corpora. The exploitation of this kind of annotated corpora requires ecient ma- nipulation processes and reversive access. We propose to use a third step representation based on a set of optimised FSA resulting from the parsing of the XML documents. These propo- sitions have been implemented in the rst ver- sion of a workbench dedicated to the French Le Monde corpus

    HUMB: Automatic Key Term Extraction from Scientific Articles in GROBID

    Get PDF
    International audienceThe Semeval task 5 was an opportunity for experimenting with the key term ex- traction module of GROBID, a system for extracting and generating bibliographical information from technical and scientific documents. The tool first uses GROBID's facilities for analyzing the structure of sci- entific articles, resulting in a first set of structural features. A second set of fea- tures captures content properties based on phraseness, informativeness and keyword- ness measures. Two knowledge bases, GRISP and Wikipedia, are then exploited for producing a last set of lexical/semantic features. Bagged decision trees appeared to be the most efficient machine learning algorithm for generating a list of ranked key term candidates. Finally a post rank- ing was realized based on statistics of co- usage of keywords in HAL, a large Open Access publication repository

    Contribution à l'analyse robuste non déterministe pour les systèmes de dialogue parlé

    Get PDF
    Colloque avec actes et comité de lecture.Nous présentons une technique d'analyse robuste dans le but de relayer la décision d'un système de reconnaissance de la parole. La stratégie d'analyse proposée est fondée sur une grammaire d'arbres adjoints lexicalisée compactée et sur la mise en concurrence des différentes hypothèses du système de reconnaissance de la parole. Les problèmes de robustesse sont étudiés en considérant les interférences entre erreurs de reconnaissance de la parole et phénomènes de parole spontanée dans les dialogues homme-machine

    Automates synchronisés pour l'intégration de techniques de reconnaissances de la parole et de compréhension du langage naturel

    Get PDF
    Colloque avec actes et comité de lecture.Nous présentons une architecture dont l'objectif est d'aboutir à une intégration forte des différents niveaux de traitement de la langue parlée. Les connaissances statiques sont représentées sous forme d'automates à états finis permettant un partage optimal des sous-structures communes. Ces automates sont utilisés dans la mise en oeuvre d'analyses stochastiques et tabulaires afin de prendre en compte le non-déterminisme des différents niveaux de traitement. Des fonctions de synchronisation sont appliquées sur ces automates afin de propager les différentes contraintes entre niveaux. l'architecture obtenue permet d'isoler compétences symboliques et probabilistes, interfaces entre niveaux d'analyse et contrôles. Des expérimentations reposant sur ces principes et ces représentations sont actuellement menées en vue d'intégrer un système de segmentation analytique, un module de reconnaissance phonétique stochastique et un analyseur basé sur les LTAG synchrones
    corecore